Gut microbiome-metabolome interactions predict host condition

Background The effect of microbes on their human host is often mediated through changes in metabolite concentrations. As such, multiple tools have been proposed to predict metabolite concentrations from microbial taxa frequencies. Such tools typically fail to capture the dependence of the microbiome-metabolite relation on the environment. Results We propose to treat the microbiome-metabolome relation as the equilibrium of a complex interaction and to relate the host condition to a latent representation of the interaction between the log concentration of the metabolome and the log frequencies of the microbiome. We develop LOCATE (Latent variables Of miCrobiome And meTabolites rElations), a machine learning tool to predict the metabolite concentration from the microbiome composition and produce a latent representation of the interaction. This representation is then used to predict the host condition. LOCATE’s accuracy in predicting the metabolome is higher than all current predictors. The metabolite concentration prediction accuracy significantly decreases cross datasets, and cross conditions, especially in 16S data. LOCATE’s latent representation predicts the host condition better than either the microbiome or the metabolome. This representation is strongly correlated with host demographics. A significant improvement in accuracy (0.793 vs. 0.724 average accuracy) is obtained even with a small number of metabolite samples (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim 50$$\end{document}∼50). Conclusion These results suggest that a latent representation of the microbiome-metabolome interaction leads to a better association with the host condition than any of the two separated or the simple combination of the two. Video Abstract Supplementary Information The online version contains supplementary material available at 10.1186/s40168-023-01737-1.


ML nomenclature
In order to facilitate the understanding of the more mathematical and Machine Learning (ML) oriented terms in the text, we provide a short description of the main ML terms used in the manuscript.
• Model is the mathematical relation between any input (in our case, microbiome ASVs, or metabolites or LOCATE's representation, Z) and the appropriate output (in our case the class of the sample/the phenotype).In ML, the model usually contains a set of parameters called weights, and the ML trains the model by finding the weights for which the model is in best agreement with the relation between the input and output in the "Training set".
• Training set The part of the data used to train the model.The quality of the fit between the input and output data on the training set is not a good measure of the quality of the model, since it may be an "overfit".
• Overfitting A problem occurs when a model produces good results on data in the training set (usually due to too many parameters), but produces poor results on unseen data.
• Validation set is a separate set from the training set that is used to monitor but is not used for the training process.This set can be used to optimize some parts of the learning process including setting the "hyperparameters".
• Model hyperparameters are adjustable values that are not considered part of the model itself in that they are not updated during training, but still have an impact on the training of the model and its performance.To ensure that those are not fitted to maximize the test set performances, the hyperparameters are optimized using an internal validation set.
• Test set Data used to test the model that is not used for either hyperparameter optimization or the training.The quality estimated on the test set is the most accurate estimate of the accuracy.
• k-Fold Cross-Validation (referred to as k CVs) is a resampling procedure used to evaluate machine learning models on a limited data sample.The data is first partitioned into k equally (or nearly equally) sized segments or folds.Subsequently, k iterations of training and validation are performed such that within each iteration a different fold of the data is held out for validation while the remaining k-1 folds are used for training.
• Receiver Operating Characteristic Curve (ROC) is a graph showing the performance of a classification model at all classification thresholds.This curve plots two parameters: True Positive Rate (TPR = is the probability that an actual positive will test positive); False Positive Rate (FPR = the probability that an actual negative will test positive).
• Area under the ROC curve (AUC) is a single scalar value that measures the overall performance of a binary classifier.The AUC value is within the range [0.5-1.0],where the minimum value represents the performance of a random classifier and the maximum value corresponds to a perfect classifier (e.g., with a classification error rate equivalent to zero).It measures the area under the ROC curve defined above.
• Factorization is the process of decomposing a matrix into the product of other smaller matrices.
• Unit vectors Vectors with a norm of one.
• Orthonormal Two vectors in an inner product space are orthonormal if they are orthogonal (or perpendicular along a line, meaning their inner product is zero), and have a norm of 1.
• Singular Value Decomposition (SVD) is the factorization of a matrix A (in our case, the microbiome-metabolite relation matrix) into the product of three matrices U , D and V t , where the columns of U and V are "orthonormal" and the matrix D is diagonal with positive real entries.By SVD, one can determine the "matrix's rank", quantify a linear system's sensitivity to numerical error, or obtain an optimal "low rank approximation" to the matrix.
• Low rank approximation A simplified representation of a matrix obtained by retaining only the most significant components or factors, typically achieved through techniques like Singular Value Decomposition (SVD).Lower-rank approximations can reduce data dimensionality while preserving key information.This process helps improve the generalization ability of models or analyses, making it easier to identify and understand key biological relationships or features.
• Latent representation is the representation of a high-dimension vector by a lower dimension with the appropriate model keeping most of the information.
• CCA is a statistical technique used to explore and quantify the relationships between two sets of variables.In simpler terms, CCA helps us understand if there are meaningful connections between two sets of data (in our case, a view (microbiome/metabolites/Z) and host features.The overlap between the pairs of the WGS datasets (red) is much higher than the overlap in the 16S datasets (blue), especially at the species level.The overlap between 16S and 16S is higher than the overlap between 16S and WGS, although the number of taxa in WGS is much higher than 16S, and one could expect the 16S taxa to be included in the WGS.      2 Supp.Mat.Tables

Figure 1 :
Figure1: LOCATE can be used to predict metabolites in each dataset separately better than all existing methods.A -E.Comparison between LOCATE and all state-of-the-art metabolites prediction models as well as a Linear network and a Log network over the different datasets FRANZOSA (A), ERAWIJANTARI(B), MARS (C), WANG (D) and YACHIDA (E).F. Comparison between LOCATE and all state-of-the-art metabolites prediction models as well as a Linear network and a Log network over the Kim dataset.

Figure 2 :
Figure 2: Intersections between pairs of cohorts of 16S and WGS at the order taxonomic level (A), and at the species level (B).The overlap between the pairs of the WGS datasets (red) is much higher than the overlap in the 16S datasets (blue), especially at the species level.The overlap between 16S and 16S is higher than the overlap between 16S and WGS, although the number of taxa in WGS is much higher than 16S, and one could expect the 16S taxa to be included in the WGS.

Figure 3 :
Figure 3: Low intersection between the orders microbiome and metabolites of different cohorts.A -D. Venn diagrams of the microbiome of triads 16S datasets.E -H.Venn diagrams of the metabolites of triads 16S datasets.Each color represents a dataset, and the intermediate colors represent the intersection.I.Histogram of average SCCs between each microbe and each metabolite that appears at least at 2 cohorts (of the 16S cohorts).The histogram's peak is at 0.0, which emphasizes the inconsistent SCCs cross datasets.J. Histogram of percent of agreement with the correlations reported in the literature and the correlations found in the cohorts.Most of the correlations do not agree with the literature.K. Heatmap of NMF coefficients between microbes and metabolites over different datasets (He, Kim and Jacob) vs the relations that are reported in the literature.Blue/Red colors represent positive/negative correlations.The relations vary between different datasets and do not preserve the known relations from the literature[1].

Figure 4 :
Figure 4: Microbiome-metabolite relations are dataset specific.A -C. Swarm plots of LOCATE's predicted metabolites SCCs in the cross-times test over the Direct Plus cohort.The dark blue points represent the SCCs of the "in-learning", referred to as "Internal", where only one time point was used for the training and the testing, by the 10 CV approach.The light blue points represent the SCCs of the "ex-learning", referred as "External", where LOCATE is trained on one time point and is tested on another one.There is a decrease in the accuracy of the ex-learning vs the in-learning.The stars follow all other figures.D -F.Swarm plots of all of the cross-datasets learning between couples of datasets, Kim-Jacob (D), Direct Plus-Kim(E), Direct Plus-Jacob (F).G. Swarm plots of all of the cross-datasets learning between couples of datasets of the Log network model.The decline in performance between the "in-learning" and "ex-learning" can be seen here, too.

Figure 5 :
Figure 5: Robustness of host condition prediction models against overfitting.A -C. AUC comparison between training and test sets for binary tasks involving 16S cohorts in microbiome-based models (A), LOCATE models (B), and metabolite-based models (C).D -F.SCC comparison between training and test sets for continuous tasks involving 16S cohorts in microbiome-based models (D), LOCATE models (E), and metabolite-based models (F).G -I. AUC comparison between training and test sets for binary tasks involving WGS cohorts in microbiome-based models (G), LOCATE models (H), and metabolite-based models (I).Dark bars denote training performance, while light bars signify test set performance.The black error bars represent the standard errors within the 10 CVs. 7

Figure 6 :
Figure 6: Comparison of various variants of LOCATE: A. Comparison of LOCATE with different normalization strategies for microbiome and metabolites (log and z-scoring) against the variant without normalization.B. Comparison of LOCATE with its second step of Low-Rank Approximation (LAP) against a regular encoder-decoder.C. Comparison of different methods of dimension reduction to create the intermediate representation Z (Fully Connected Network (FCN), 1D Convolutional Neural Network (1D-CNN), deep network with 5 CNN layers) in terms of metabolite prediction performance.D. Comparison of the same dimension reduction methods for the phenotype prediction performance.The black error bars represent the standard errors within the 10 cross-validation runs.8

Figure 7 :
Figure 7: Host condition predictions based on targeted metabolites vs. untargeted metabolites.In each cohort with untargeted metabolites, the condition is predicted, once based on models that are trained only on classified metabolites (the dark bars), and once on all the metabolites including unclassified ones.LOCATE based on untargeted metabolites (light blue) outperforms all the other methods.The performance is measured as the average AUC (for binary phenotypes) and SCC for continuous phenotypes on a test set over 10 CVs.The black error bars represent the standard errors within the 10 CVs.

Figure 8 :
Figure 8: Average coefficients of each metabolite in the real dataset (dark bar) and in the shuffled one (light bar).The black error bars are for the standard errors.

Table 1 :
Summary of current state-of-the-art methods

Table 4 :
Metadata of each cohort

Table 8 :
Clustering components of Fig.5F, G and H.Each cluster is represented by 2 colors of its 2 first dimensions.